
clear all
set more off

***************************************************************************************************
*********************Get student demographic data************************************************
***************************************************************************************************
*Student demographics that I will gather:  gender, ethnicity, age, EL, FRL, parental education, zipcode, school/grade

*Inputs: the student demographic files from 2003-2017
*1. Student demographic files from 2003-2017 (note that I use the demographics from the fall enrollment report and only use Spring reports to grab school of attendance)
*2. "VA_test_scores.dta" (Created by .do file "1 - Get_Test_Data.do")
*3. "/lausd_distance_chars_bound_data.dta": This is a manually generated data (data provided by author either upon request or on the Data Archive Site) that crucially contains schools' distance to gas leak along with some school characteristics (e.g., charter status, etc) 

*Outputs:
*1. "VA_demographics.dta"
*2. "main_analysis_data.dta"
*3. Table 1 (summary stats)
***************************************************************************************************
***************************************************************************************************
***************************************************************************************************

*One thing to consider: demographics are reported for both fall (DECEMBER) and spring (JUNE).  
*Current solution: use fall (December) as gospel, only grabbing school location from JUNE file

*Note that the file format often changes, hence this is done in "blocks" of years with similar formats

*Start with 2015 and 2017 which have near-identical formats
foreach n of numlist 5(1)7{
clear all
import delimited "/data/Demographics/Demographics 201`n'.csv"

*Rename variables
ren ÿþendyear schendyr 
ren occurencedescription occoccurrencedescr
ren schoolcdscode schcdscode
ren preferredlocationcode prlpreferredlocationcode
ren preferredlocationname prlpreferredlocationname
ren schoollocationcode schlocationcode 
ren trackcode trctrackcode
ren studentpseudoid stdpseudoid 
ren studentdateofbirth stddob
ren ethnicitydescription ethnicitydescr
ren languageclasscode langclscode 
ren homelanguagedescription homelangdescr
ren povertyindicatorcode frlunch
ren  studentgradefristenrolled stdgradefirstenroll
ren studentcaschoolfirstattenddate stdca1stschooldate
ren migrantindicatorcode migrant
ren ell_level_code ellevelcode
ren studentellredesignationdate stdelredesignationdate 
ren resschcode schreslocationcode

if `n'<=6{
ren zipcode stdzipcode
}
else if `n'==7{
ren homezipcode stdzipcode
}

*Make these strings (so can merge across other years)
tostring prlpreferredlocationcode schlocationcode schreslocationcode stdzipcode, replace

*Drop uneeded variables
drop trctrackcode stdgradefirstenroll stdca1stschooldate homelessindicatorcode migrant leavecode leavedate

*Get Spring school location
preserve
keep if occoccurrencedescr=="JUNE"
keep stdpseudoid prlpreferredlocationcode schlocationcode
ren prlpreferredlocationcode prlprefcode_spring
ren schlocationcode  schcode_spring
*Just one duplicate which is actual duplicates, so drop with force command
duplicates drop stdpseudoid, force
save "/data_analysis/Eliso_Complete/Data/temp.dta", replace
restore

*Merge in spring school location
keep if occoccurrencedescr=="DECEMBER"
merge 1:1 stdpseudoid using "/data_analysis/Eliso_Complete/Data/temp.dta"
drop if _merge==2
drop _merge
erase "/data_analysis/Eliso_Complete/Data/temp.dta"

save "/data_analysis/Eliso_Complete/Data/Demographics 201`n'.dta", replace
}


*Now do years 2003 and 2006*
*For these demographic files you restrict to the "NORM" collection as it has the most observations 
foreach n of numlist 3 6{
clear all
import delimited "/data/Demographics/Demographics 200`n'.txt"
keep if occoccurrencedescr=="NORM FALL" | occoccurrencedescr=="NORM SPRING"

tostring prlpreferredlocationcode schlocationcode schreslocationcode stdzipcode, replace

drop trctrackcode stdgradefirstenroll stdca1stschooldate homelessindicatorcode migrant leavecode leavedate

*Get Spring school location
preserve
keep if occoccurrencedescr=="NORM SPRING"
keep stdpseudoid prlpreferredlocationcode schlocationcode
ren prlpreferredlocationcode prlprefcode_spring
ren schlocationcode  schcode_spring
*Drop duplicates as cannot assign to school (very few of them fortunately)
duplicates tag stdpseudoid, gen(g)
drop if g>0
drop g
save "/data_analysis/Eliso_Complete/Data/temp.dta", replace
restore

*Merge in spring school location
keep if occoccurrencedescr=="NORM FALL"
duplicates tag stdpseudoid, gen(g)
drop if g>0
drop g
merge 1:1 stdpseudoid using "/data_analysis/Eliso_Complete/Data/temp.dta"
drop if _merge==2
drop _merge
erase "/data_analysis/Eliso_Complete/Data/temp.dta"

compress
save "/data_analysis/Eliso_Complete/Data/Demographics 200`n'.dta", replace
}


*Now do years 2004-2012 (except 2006)

*For these years, data files for these are split into two: Spring and Fall
*Let's start by grabbing the only Spring variable we need: school of attendance
foreach n of numlist 4 5 7 8 9 10 11 12{
clear all

if `n'<=9{
import delimited "/data/Demographics/Demographics 200`n'Spring.txt"
}
else if `n'>=10{
import delimited "/data/Demographics/Demographics 20`n'Spring.txt"
}

keep if occoccurrencedescr=="NORM SPRING"

tostring prlpreferredlocationcode schlocationcode schreslocationcode stdzipcode, replace

keep stdpseudoid prlpreferredlocationcode schlocationcode
ren prlpreferredlocationcode prlprefcode_spring
ren schlocationcode  schcode_spring
*Drop duplicates as cannot assign to school (very few of them fortunately)
duplicates tag stdpseudoid, gen(g)
drop if g>0
drop g
save "/data_analysis/Eliso_Complete/Data/temp_`n'.dta", replace
}

*Now we grab all the demographic variables from the fall
foreach n of numlist 4 5 7 8 9 10 11 12{
clear all

if `n'<=9{
import delimited "/data/Demographics/Demographics 200`n'Fall.txt"
}
else if `n'>=10{
import delimited "/data/Demographics/Demographics 20`n'Fall.txt"
}

keep if occoccurrencedescr=="NORM FALL"

tostring prlpreferredlocationcode schlocationcode schreslocationcode stdzipcode, replace

drop trctrackcode stdgradefirstenroll stdca1stschooldate homelessindicatorcode migrant leavecode leavedate

*Merge in spring school location
keep if occoccurrencedescr=="NORM FALL"
duplicates tag stdpseudoid, gen(g)
drop if g>0
drop g
merge 1:1 stdpseudoid using "/data_analysis/Eliso_Complete/Data/temp_`n'.dta"
drop if _merge==2
drop _merge
erase "/data_analysis/Eliso_Complete/Data/temp_`n'.dta"

if `n'<=9{
save "/data_analysis/Eliso_Complete/Data/Demographics 200`n'.dta", replace
}
else if `n'>=10{
save "/data_analysis/Eliso_Complete/Data/Demographics 20`n'.dta", replace
}
}

*Now do school years 2012-13 and 2013-14*
foreach n of numlist 3 4{
clear all
import delimited "/data/Demographics/Demographics 201`n'.csv"

ren ÿþendyear schendyr 
ren occurencedescription occoccurrencedescr
ren schoolcdscode schcdscode
ren preferredlocationcode prlpreferredlocationcode
ren preferredlocationname prlpreferredlocationname
ren schoollocationcode schlocationcode 
ren trackcode trctrackcode
ren studentpseudoid stdpseudoid 
ren studentdateofbirth stddob
ren ethnicitydescription ethnicitydescr
ren languageclasscode langclscode 
ren homelanguagedescription homelangdescr
ren povertyindicatorcode frlunch
ren  studentgradefristenrolled stdgradefirstenroll
ren studentcaschoolfirstattenddate stdca1stschooldate
ren migrantindicatorcode migrant
ren ell_level_code ellevelcode
ren studentellredesignationdate stdelredesignationdate 
ren resschcode schreslocationcode
ren zipcode stdzipcode

tostring prlpreferredlocationcode schlocationcode schreslocationcode stdzipcode, replace

drop trctrackcode stdgradefirstenroll stdca1stschooldate homelessindicatorcode migrant leavecode leavedate

*Get Spring school location
preserve
if `n'==3{
keep if occoccurrencedescr=="NORM SPRING"
}
else if `n'==4{
keep if occoccurrencedescr=="END SPRING"
}
keep stdpseudoid prlpreferredlocationcode schlocationcode
ren prlpreferredlocationcode prlprefcode_spring
ren schlocationcode  schcode_spring
*Just one duplicate which is actual duplicates, so drop with force command
duplicates drop stdpseudoid, force
save "/data_analysis/Eliso_Complete/Data/temp.dta", replace
restore

*Merge in spring school location
keep if occoccurrencedescr=="NORM FALL"
merge 1:1 stdpseudoid using "/data_analysis/Eliso_Complete/Data/temp.dta"
drop if _merge==2
drop _merge
erase "/data_analysis/Eliso_Complete/Data/temp.dta"

save "/data_analysis/Eliso_Complete/Data/Demographics 201`n'.dta", replace
}




***Merge all together
clear all
foreach n of numlist 3(1)9{
append using "/data_analysis/Eliso_Complete/Data/Demographics 200`n'.dta"
}
foreach n of numlist 0(1)7{
append using "/data_analysis/Eliso_Complete/Data/Demographics 201`n'.dta"
}
*Go through each demo variable and clean
drop occoccurrencedescr stdelredesignationdate
*Gender: code as a dummy variable
gen gender=1 if gendercode=="M"
replace gender=0 if gendercode=="F"
drop gendercode 

*Ethnicity -- create missing for "unknown"
encode ethnicitydescr, gen(ethnicity)
drop ethnicitydescr
replace ethnicity=. if ethnicity==7

*Age*
gen year_1=schendyr-1
gen start_term = mdy(9, 1, year_1)
gen dob=substr(stddob,1,2) if schendyr<2013
gen mob=substr(stddob,4,3) if schendyr<2013
gen yob=substr(stddob,8,4) if schendyr<2013
replace dob=substr(stddob,9,2) if schendyr>=2013
replace mob=substr(stddob,6,2) if schendyr>=2013
replace yob=substr(stddob,1,4) if schendyr>=2013
replace mob="01" if mob=="JAN"
replace mob="02" if mob=="FEB"
replace mob="03" if mob=="MAR"
replace mob="04" if mob=="APR"
replace mob="05" if mob=="MAY"
replace mob="06" if mob=="JUN"
replace mob="07" if mob=="JUL"
replace mob="08" if mob=="AUG"
replace mob="09" if mob=="SEP"
replace mob="10" if mob=="OCT"
replace mob="11" if mob=="NOV"
replace mob="12" if mob=="DEC"
destring dob mob yob, replace
gen date_birth=mdy(mob,dob,yob)
gen age = (start_term - date_birth)/365.25
drop date_birth year_1 start_term stddob mob dob yob
*If age <4 or >20, just code as missing as surely a typo
replace age=. if age<4 | age>20

*Parental Education*
encode parentedulevelname, gen(pared)
drop parentedulevelname
*Set to missing if answered "Decline to answer"
replace pared=. if pared==2

***Dealing with grades -- only keep grade K-12***
drop if gradecode=="18+" | gradecode=="TK" | gradecode=="TE" | gradecode=="PRE K" | gradecode=="PK" | gradecode=="IN" | gradecode=="0T"
replace gradecode="0" if gradecode=="K"
replace gradecode="." if gradecode=="U" | gradecode=="UNKNOWN" | gradecode=="UNK"
destring gradecode, replace

*Keep only grade 0-8
drop if gradecode>=9 

*FRL*
encode frlunch, gen(frl)
drop frlunch

*Destring school codes and zipcode (assign to missing if "0")
replace stdzipcode="." if stdzipcode=="`9004" | stdzipcode=="`9074"
destring prlpreferredlocationcode stdzipcode schreslocationcode schlocationcode, replace i("UNKNOWNca||homel/-LOSAC.--#+@-N/A-n/a-LOS AOCA-D")
replace prlpreferredlocationcode=. if prlpreferredlocationcode==0
replace stdzipcode=. if stdzipcode==0
replace schreslocationcode=. if schreslocationcode==0
replace schlocationcode=. if schlocationcode==0

destring prlprefcode_spring schcode_spring, replace i("UNKNOWNca||homel/-LOSAC.--#+@-N/A-n/a-LOS AOCA-D")
replace prlprefcode_spring=. if prlprefcode_spring==0
replace schcode_spring=. if schcode_spring==0

ren schendyr year
ren gradecode grade
***Will use prlpreferredlocationcode as location measure (but will keep other one, schoolcode). They are 99.5% correlated (300k/5.1mil do not agree)
ren prlpreferredlocationcode locationcode

compress
save "/data_analysis/Eliso_Complete/Data/VA_demographics.dta", replace

*Erase uneeded files
foreach n of numlist 3(1)9{
erase "/data_analysis/Eliso_Complete/Data/Demographics 200`n'.dta"
}
foreach n of numlist 0(1)7{
erase "/data_analysis/Eliso_Complete/Data/Demographics 201`n'.dta"
}




***************************************************************************************************
***************************************************************************************************
***************************************************************************************************

*Now, going to combine the student demographics with the test score data. Also will merge on school characteristics.
*This will create the master analysis dataset, which will be at the student-year level.

***************************************************************************************************
***************************************************************************************************
***************************************************************************************************

clear all
set more off

use "/data_analysis/Eliso_Complete/Data/VA_demographics.dta"

*Merge in the school characteristics
merge m:1 locationcode using "/data_analysis/Eliso_Complete/Data/lausd_distance_chars_bound_data.dta"
drop if _merge==2
drop _merge

***********************************************************************************************
***********************************************************************************************
**************************Table 1**************************************************************
***********************************************************************************************
***********************************************************************************************
*Doing this table before the "distance" restriction
preserve
keep if year==2016
*Merge in test scores
merge 1:1 stdpseudoid year using "/data_analysis/Eliso_Complete/Data/VA_test_scores.dta"
drop if _merge==2
drop _merge
encode ellevelcode, gen(el_code)
replace pared=7 if pared==.
replace ethnicity=10 if ethnicity==.

*Gen demo vars
gen hispanic=(ethnicity==5)
gen white=(ethnicity==8)
gen black=(ethnicity==3)
gen asian=(ethnicity==2)
gen fl=(frl==3)
replace fl=. if frl==2
gen el=(el_code!=6)
gen dropout=(pared==5)
gen HS=(pared==4)
gen some_college=(pared==6)
gen college=(pared==1 | pared==3)
foreach var of varlist hispanic white black asian {
replace `var'=. if ethnicity==10 
}
foreach var of varlist dropout HS some_college college{ 
replace `var'=. if pared==7
}

*Column (1)
keep if grade==4 | grade==5
foreach var of varlist math_scorez ela_scorez lag_math lag_ela hispanic white black asian fl el  dropout HS college some_college{
su `var'
}
distinct locationcode
*Column (2)
drop if charter_type=="independent"
gen running_var=distance_calc-5.05
keep if running_var>=-1.75 & running_var<=1.75
foreach var of varlist math_scorez ela_scorez lag_math lag_ela hispanic white black asian fl el  dropout HS college some_college{
su `var'
}
distinct locationcode
*Column (3)
drop if math_scorez==. | lag_math==.
foreach var of varlist math_scorez ela_scorez lag_math lag_ela hispanic white black asian fl el  dropout HS college some_college{
su `var'
}
distinct locationcode

restore
***********************************************************************************************
***********************************************************************************************
***********************************************************************************************
***********************************************************************************************
***********************************************************************************************



*Create ram space by dropping anything with distance bigger than 10 miles (for some results such as the D-in-D this restriction is not made)
drop if distance_calc>10
drop nces_id name multi_lingual principalemail schcdscode

*Merge in test scores
merge 1:1 stdpseudoid year using "/data_analysis/Eliso_Complete/Data/VA_test_scores.dta"
drop if _merge==2
drop _merge

*Note testing_grade and grade are identical in RD sample
ren grade grade2
gen grade=testing_grade
replace grade=grade2 if grade==.
drop grade2

*Manually input the percent of students living within five miles according to authors calculations using 2010 census (used for a minor robustness check)
gen pct_living_5=100 if distance_calc<=5.05
replace pct_living_5=0 if distance_calc>5.05

replace pct_living_5=90.18 if locationcode==4117
replace pct_living_5=89.79 if locationcode==4762
replace pct_living_5=74.22 if locationcode==3027
replace pct_living_5=67.09 if locationcode==7007
replace pct_living_5=88.94 if locationcode==7201
replace pct_living_5=31.01 if locationcode==2117

replace pct_living_5=31.63 if locationcode==7247
replace pct_living_5=59.12 if locationcode==4515
replace pct_living_5=45.31 if locationcode==3541

*Independent charters were not eligible for air filters so drop
drop if charter_type=="independent"
drop charter_type
*Also drop the four schools that were evacuated or received evacuees
drop if evacuated==1
drop if receive_evac==1
drop evacuated receive_evac

***Okay let's save the analysis data*****
compress
save "/data_analysis/Eliso_Complete/Data/main_analysis_data.dta", replace

